Skip to content

Conversation

jbrockmendel
Copy link
Member

cc @WillAyd I'm still seeing 88 test failures locally and could use a fresh pair of eyes on this. Any thoughts?

@jbrockmendel
Copy link
Member Author

Tentatively looks like the issue is that this patches _index_data without updating _data

@WillAyd
Copy link
Member

WillAyd commented Jul 8, 2020

If it helps I've noticed that changing this line:

chunk = slider.dummy

To chunk = slider.frame[starts[i]:ends[i]].copy() takes 88 failures down to about 20. Not a real solution as we don't want the copy, but I think this new design isn't fully compatible with some of the dummy work that the sliders do, so might need to patch together

Seems heading in the right direction though

@jbrockmendel
Copy link
Member Author

closing to clear the queue

@WillAyd WillAyd reopened this Jul 22, 2020
Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel made a few updates to get failures down to a small handful. may or may not be the right way of doing things but hopefully continues this conversation

@WillAyd
Copy link
Member

WillAyd commented Jul 22, 2020

Of the four remaining failures one of them is a result of groupby.tshift which maybe we will deprecate in #34452

@jbrockmendel
Copy link
Member Author

@WillAyd is there cause for optimism here? A solution here might have a bearing on one of the test failures in #35417

@WillAyd
Copy link
Member

WillAyd commented Jul 28, 2020

Well the optimism is that we are down to a handful of failures :-)

I think though that this is uncovering deep seeded issues in other parts of the code base. For instance, this is causing the test that #33439 added to fail which right now I've traced back to a potential bug somewhere in our hashing ocde, but I think it's only happen stance that that test works in the first place.

Not sure if the other 4 failures are related to hashing as well but will hopefully figure out soon

@WillAyd
Copy link
Member

WillAyd commented Jul 29, 2020

I've noticed that as an apply slides along various groups that this line of code looks problematic with the current changes:

values = self._get_index_values()

The problem is that the call to _get_index_values hits a cached property that with this doesn't seem to be updated, and only ever references the index of the first group. So if you tried to split a DataFrame with 3 elements into [0], and [1, 2] groupings, a hashtable only ever gets built for the [0] group, the next iteration throws a KeyError when trying to figure out the location of [1, 2] and a whole slew of exception handling routes that off to space from there

@jbrockmendel any idea how the caching should be handled here? In the Python space the above call traces back to here:

target_values = self._get_engine_target()

@WillAyd
Copy link
Member

WillAyd commented Jul 29, 2020

Here's the MRE I'm using to tackle the above comment.

df = pd.DataFrame({"A": ["S", "W", "W"], "B": [1.0, 1.0, 2.0]}) 
res = df.groupby("A").agg({"B": lambda x: x.get(x.index[-1])})

On master this yields

     B
A     
S  1.0
W  2.0

But with this PR yields

     B
A     
S  1.0
W  NaN

Because of the aforementioned KeyError when trying to locate elements by index in the second grouping

@jbrockmendel
Copy link
Member Author

The problem is that the call to _get_index_values hits a cached property

I'm not sure I follow. In index.pyx this is a call to self.vgetter(), which I think is supposed to return _index_data, which gets patched within libreduction. Am I misunderstanding?

cache_readonlys on the Index object definitely seem like a footgun. In the general case, the only way I can think of to avoid this is to create a fresh Index object, but avoiding that overhead is basically the whole point of the shenanigans in libreduction.

@WillAyd
Copy link
Member

WillAyd commented Jul 29, 2020

Where is the call to _index_data? That would seem related but I wasn't seeing that; could be the missing link

@jbrockmendel
Copy link
Member Author

Where is the call to _index_data? That would seem related but I wasn't seeing that; could be the missing link

I expected to see it in Index._engine, but apparently not

@WillAyd
Copy link
Member

WillAyd commented Aug 4, 2020

/azp run

@azure-pipelines
Copy link
Contributor

Azure Pipelines successfully started running 1 pipeline(s).

@jbrockmendel jbrockmendel mentioned this pull request Aug 11, 2020
5 tasks
@jbrockmendel
Copy link
Member Author

Just pushed after applying a patch from #35417. Locally that gets me to 3 test failures (which now that i reread the thread may not actually be an improvement)

@jbrockmendel jbrockmendel mentioned this pull request Sep 12, 2020
2 tasks
@jbrockmendel
Copy link
Member Author

Updated to implement NDFrame._can_use_libreduction to de-duplicate a bunch of similar (and inconsistently strict) checks done elsewhere. (Note an inline-comment I'll make about a place where this PR retains a different, inconsistent check)

This fixes an xfailed tests.groupby.test_apply.test_apply_with_timezones_aware

# see see test_groupby.test_basic
result = self._aggregate_named(func, *args, **kwargs)
if isinstance(
self._selected_obj.index, (DatetimeIndex, TimedeltaIndex, PeriodIndex)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the one place where i'm not using self._selected_obj._can_use_libreduction, as doing so would require 2.5 more kludges to get the tests passing:

  1. below on 283 after ret = create_series_with_explicit_dtype would need to do
            # Inference in the Series constructor may not infer
            #  custom EA dtypes, so try here
            ret = maybe_cast_result(ret._values, obj, numeric_only=True)
            ret = Series(ret, index=index, name=obj.name)

1.5) in create_series_with_explicit_dtype would need to change dtype_if_empty=object to dtype_if_empty=obj.dtype (which im not 100% sure about)

  1. The kludge here would have to be amended from and name in output.index to and (name in output.index or 0 in output.index)

@jbrockmendel jbrockmendel marked this pull request as ready for review September 17, 2020 21:10
@jbrockmendel
Copy link
Member Author

Closing.

This doesn't do quite what I thought it did, will need a new approach, xref #36459.

@WillAyd
Copy link
Member

WillAyd commented Sep 19, 2020

This thing is super thorny...nice effort in any case here. We will figure it out one of these days

@jbrockmendel jbrockmendel deleted the ref-libreduction-5 branch March 2, 2021 17:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Internals Related to non-user accessible pandas implementation

Projects

None yet

Development

Successfully merging this pull request may close these issues.